1. Optimization Theory

Overview of Optimization Theory and the Four main types of Optimization Problems

Optimization Theory is a branch of mathematics devoted to solving optimization problems. Optimization problems are mathematical functions where we want to minimize or maximize the function value. These types of problems are found numerously in computer science and applied mathematics. Finding good solutions to these problems is what gives data science and machine learning models good accuracy, as they themselves are built off numerical methods used for finding good solutions to minimize error functions. Examples of common optimization problems in Machine Learning are minimizing MSE, MAE, Cross-Entropy, etc.

Table of Contents

Overview

In a brief overview, optimization theory can be reduced down to three questions:

  1. What is the Objective Function?
  2. What are the Set of Variables of Interest?
  3. What are the Set of Constraints?

Knowing how your objective function looks like in terms of number of dimensions, domain, and local extrema count is vital in choosing an algorithm. In addition, it is necessary to understand the data type for each of the variables: discrete or continuous? Lastly, knowing the set of constraints is important to limit your algorithm in searching the domain space for only feasible solutions.

If you remember from calculus, optima, otherwise known as critical points are points in the domain space such that the slope of the function at that point is zero. These points are commonly referred to as extrema, classified as either minima or maxima. Extrema can be classified into three four main types:

  1. Weak Extrema
  2. Strong Extrema
  3. Global Extrema
  4. Saddle Points

U1_1.png

Weak extrema are critical points such that the neighboring points around has the same function value, creating a plateau.

Strong Extrema are critical points such that it is has either the largest (for maximization) or smallest (for minimization) function value from the neighboring points, creating a valley or mountain.

Global Extrema are Strong Extrema points such that its function value is either the largest (for maximization) or smallest (for minimization) globally for all Strong Extrema points.

Saddle Points (will not be discussed in the course) represent inflexion points between concave up and concave down trends.

There are four main types of optimization problems:

Unconstrained Problems

Unconstrained Problems are just like they sound, unconstrained. These are functions where we want to minimize or maximize the function given a set of variables whose domain is either continuous, discrete, or both. Here we have a definition:

DEF: Unconstrained Problem:

  • minimize $f(x)$, $x=(x_1,x_2,\dots,x_n)$;
  • subject to $x_j\in domain(x_j)$.
  • Where $n$ denotes number of variables.

Constrained Problems

Constrained Problems are just like Unconstrained Problems in terms of set up but have unique constrained of the input space. Here we have a definition:

DEF: Constrained Problem:

  • minimize $f(x)$, $x=(x_1,x_2,\dots,x_n)$
  • subject to $x_j\in domain(x_j)$, $g_m(x)\leq 0$, $m=1,\dots,n_g$. Where $n$ denotes number of variables, $g_m(x)$ denotes a constraint, and $n_g$ denotes the number of constraints.

Example

First, we have the objective function denoted by A. Then we have the global unconstrained minimum at the bottom right at point E. However, we introduce an inequality constraint denoted by Line D where any value in the checkered region is infeasible. Thus, we are left with a part of the domain to work with. Point B and C are two critical points in the feasible domain space. However, the next global minimum in the constrained environment would be at point F. Thus, in scenarios such as these we would like our algorithm to steer away from the infeasible domain space and find the next best minimum value.

U1_4.png

An example of the interaction of the domain space between two variables in a constrained environment.

Here we can see that X and Y are continuous variables which can range between 0 and 2. However, we add a constraint such that as X tends towards 2, Y decreases; but if Y tends towards 2, X starts to decrease. This can be seen by the blue semi-ring, which represents the feasible input space and the transparent background represents the infeasible domain space.

U1_5.png

Multi-Solution Problems

Multi-Solution Problems are functions where there are no unique global extrema; instead there are tied global solutions. In these instances we want to return all the tied global solutions (hence the name multi-solution) to give us a range of options to choose from.

In a real application, we might want to minimize some type of cost function, where we’ve found that there exists multiple solutions. Our goal is to return these multiple solutions so that we have the ability to choose which point or vector of variable values best fits our interest. Here we have an exact definition:

DEF: Multi Solution Problem: Find a set of solutions, $X=\{x_1,\dots,x_n\}$, such that each $x_i$ is a global extrema for the optimization problem.

Example

It is a variation of the sin wave function. As we can see, we have tied global minima and maxima for this problem. Note However that not all multi solution problems occur periodically like a sin wave, where it is easy to predict the next value, these tied global extrema will most likely spuriously throughout the domain space and without pattern.

U1_7.png

Multi-Objective Problems

Multi-Objective Problems (MOP) are problems where we have many different objectives, or optimization problems, that we need to solve simultaneously. There are two main methods for solving these types of problems:

  1. Weighted Aggregation
  2. Pareto Optimality

Weighted Aggregation is simply an aggregate of all the objective functions. We simply sum up each objective function, multiplied by an associated weight value, and try to minimize or maximize that sum. It is usually assumed that the sum of the weights are equal to one. Here we have an example aggregate function we want to simultaneously minimize F1 and F2. Note that each Function has an associated weight value, where larger weight values indicate which function has the most influence.

$$F(x)=0.4f_1(x)+0.6f_2(x)$$

DEF: Multi Objective Problem (Weighted Aggregation):

  • minimize $F(x)$, $F(x)=\sum_{k=1}^{n_k}f_k(x)$ subject to $x_j\in domain(x_j)$, $g_m(x)\leq 0$, $m=1,\dots,n_g$, $w_k(x)\geq 0$, $k=1,\dots,n_k$ *Where $n$ denotes number of variables, $g_m(x)$ denotes a constraint, $n_g$ denotes the number of constraints, and $n_k$ denotes number of objectives, and usually $\sum_{k=1}^{n_k}w_k=1$.

Pareto Optimality is another way to solve MOP by classifying possible solutions in a dominance structure. Domination is a way to distinguish good solutions from bad. There are two main types of domination: Strong and Weak.

Strong domination occurs when a possible solution is equal to or better than another possible solution in all objectives, and is better than in at least one objective.

DEF: Strong Domination (Pareto-Optimality): A decision vector $x_1$ strongly dominates another decision vector $x_2$ if and only if:

  • $x_1$ is equal to or better than $x_2$ in all objectives: $f_k(x_1)\leq f_k(x_2) \forall k=1,\dots,n_k$
  • $x_1$ is better than $x_2$ in at least one objective: $\exists k | f_k(x_1)<f_k(x_2)$

Weak domination on the other hand states that a possible solution is not worse than another solution in all objectives.

DEF: Weak Domination (Pareto-Optimality): A decision vector $x_1$ strongly dominates another decision vector $x_2$ if and only if:

  • $x_1$ is better than $x_2$ in at least one objective: $\exists k | f_k(x_1)<f_k(x_2)$

U1_5.png

U1_6.png

U1_7.png

Benchmarking Test Functions

Benchmarking test functions are designed mathematical optimization problems for testing algorithms. These are artificially designed problems with no application, but designed in such a way that they are ‘as difficult’ or more ‘difficult’ than real world applications. In literature, instead of comparing algorithms based upon certain applications which may require extensive domain knowledge concerning the subject; researchers often compare functions by how well they can find global extrema in these benchmarking test functions. There are many different types of benchmarking test functions, here are three test functions that are commonly found in literature:

U1_8.png

U1_9.png

U1_10.png

Conclusion

In summary, optimization problems can be classified into four different categories. Unconstrained problems where the bounds are fixed such that they do not change. Constrained problems where the domain changed dependent upon variable values. Multi solution problems where there are no unique global minima/maxima and we want to return all values. Lastly, multi objective where we want to find the pareto front.

Optimization problems are extremely abundant in numerous fields of research, whether it be from traffic scheduling, maximizing production based off different variables, efficient electricity use, to neural network training, game modeling, or optimizing data science models. The goal in these circumstances is to minimize or maximize the optimization function, where each optima represents a possible solution. However, we do not want just a possible solution, we want the best solution, the global minima or maxima. So how do we find these global extrema points? We could use standard numerical methods such as Newtons method or conjugate gradient; or perhaps we could use other optimization methods such as leapfrog, hill climber, or simulated annealing. In this series we will cover using evolutionary computation as a means to find these global extrema.

U1_11.png

Why would we want to use evolutionary computation in first place over classical optimization techniques? Well the problem with classical optimization algorithms are usually that they are

On the other hand evolutionary computation

Because of these reasons, evolutionary computation are commonly used on problems that classified as NP hard problems, such as traveling salesman or scheduling, or when the optimization function is not differentiable.

2. Introduction To Evolutionary Computation

Table of Contents

2.1 Biological Inspiration

Evolution can be described as a process by which individuals become ‘fitter’ in different environments through adaptation, natural selection, and selective breeding.

Here we have a picture of the famous finches Charles Darwin depicted in his journal while on the Galapagos Island. He noted the extreme diversity within the species for beak sizes. This is an example of mutation.

U2_1.png

Here we have a picture of a trivial example of natural selection, adaptation, and selective breeding.

U2_2.png

In evolutionary computation, we try to model these principles to find the best solution to a problem. Each possible solution to a problem is represented as an individual in a pool of a population, where we perform adaptation, natural selection, and selective breeding on those possible solutions to ultimately find the best solution for the problem.

2.2 Overview of Evolutionary Algorithms

For a brief overview of the canonical genetic algorithm,

U2_3.png

2.3 Representation of the Chromosome

Cells are one of the most basic building block of organisms. Each cell is built off strands of DNA. Chromosomes, the main structures of DNA, contain large strands of genes, where each gene represents some type of inherited characteristic. In EC, we want to model these characteristics into our optimization problems. An individual in EC is simply a point in the domain space for that optimization problem, where each variable value for the function is represented by a gene. Thus, each point refers to an individual where each component of that point, i.e. variable value, is encoded as a gene in the genome. The genotype describes the genetic makeup of an individual. On the other hand, the phenotype defines what the individual looks like in the environment.

2.4 Initial Population

Evolutionary Algorithms (EA) are population based search algorithms, meaning it works by taking a pool of initial points and searches these points in parallel.

Unlike standard numerical methods, such as Newtons method where you only feed it one initial value, EA’s work by using the diversity of the population to search for better solutions. In addition, we also need a way to represent the entire search space/domain of the problem. If our initial population only includes values from a centrally located part of the search space then our algorithm will only find optimal solutions near that search space.

Therefore, to ensure diversity, we need to randomly sample uniformly from our domain space. Now, whenever we are dealing with constrained problems where the bounds are different dependent upon variable values, our initial uniform sampling might lead to infeasible solutions. To get around this we either have to hard code this constraint for our uniform sampling or just sample over the entire input space but only keep feasible solutions.

Lastly, the choice of the initial population size is crucial to the convergence and complexity of the algorithm.

The choice of size for the initial population is dependent upon the

2.5 Choosing Fitness Function

In evolution, the motto is that only the best fit individuals in the environment survive. How do we determine which individuals are fit?

In EC the fitness function is the optimization problem itself, where values that have larger values have better fitness. In the case of minimization, we can scale the function values such that smaller values result in larger fitness values. The selection of the fitness function is directly dependent upon what problem is needing to be optimized.

In Optimization Theory, there are four main types of problems, unconstrained, constrained, multi objective, and multi solution. EA’s can work on all these types of problems but may have difficulty with two, constrained and multi solution.

1) Constrained problem: Because the domain is dynamic in a constrained environment, our algorithm may encounter infeasible solutions. To get around this we have two main solutions.

DEF: Penalty Method: minimize $F(x)$ $$F(x)=f(x)+\lambda p(x)$$ where $p(x)$ denotes the penalty value of $x$, and $\lambda$ dentoes a constant to control the strength of the penalty

2) Multisolution problem: EA’s tend to converge to singular points or a group of a few points. In instances where there are many solutions that need to be returned, EA’s can struggle in these circumstances.

2.6 Selection

Once we have our initial population and compute the fitness of each individual, we need a way to determine who will be selected for survival and reproduction.

In EA, there is a trade off between exploitation and exploration. 1) Exploitation simply refers to the convergence quality of the algorithm. If we select an algorithm that always chooses the fittest individuals to survive, then we are exhibiting high exploitation but low exploration. High exploitation leads to quick convergence; but, it can also lead to premature convergence.

U2_12.png

For example, in the picture above (assuming maximization), the global maximum is at B, but if that position has not been explored by our algorithm, then it might prematurely converge to A or C depending on how much of the input space has been explored. In the picture however, the initial point lies between A and B. Therefore it will either converge to A or B, depending on random chance from the algorithm.

2) In opposition to exploitation is Exploration, this selection type does not always choose the fittest individuals to mate, encouraging exploration of the input space. However, by encouraging exploration, the algorithm can have poor convergence qualities.

In practice, it is sometimes best to utilize both, exploration in the beginning and then exploitation towards the end.

Selective Pressure refers to how long it would take to produce a uniform population, therefore high selective pressure decreases diversity whereas low selective pressure encourages exploration.

There are many selection methods, but here are the most popular:

2.6.1 Random Selection

Random selection is a simple method that has good exploration qualities as it chooses the individuals to survive at random; however, because of this, it does not guarantee that the current best solution will survive, leading to possible solutions being lost. It is most commonly paired with elitism as we will see in a moment.

U2_13.png

2.6.2 Proportional Selection (Roulette Wheel)

Proportional selection creates a probability distribution from the fitness values by dividing each fitness value by the sum of fitness values. Because selection is based off of probability, it ensures good diversity while also ensuring good convergence; however, because the probability is based off the proportion of the fitness value to the rest of the fitness values, individuals early on with large fitness values can seem to dominate the selection process.

U2_14.png

2.6.3 Rank Based Selection (Roulette Wheel)

To alleviate the problem of large fitness values from dominating the reproduction space in proportional selection, Rank Based Selection has been proposed. It is like that of proportional except the probability of being chosen is not based off the raw fitness value, but rather the ranking of the individual. From the fitness values, individuals are ranked from worst to best where the best has rank one, and then the proportion decreases for individuals with higher rankings either linearly or nonlinearly. Above, we have an example of rank based selection except where the best is ranked highest and worst is ranked lowest. The exact ranking of an individual depends upon what type of rank based selection equation you are using.

U2_15.png

2.6.4 Tournament Selection

Tournament Selection on the other hand, randomly selects a group of individuals, where the amount is known as tournament size, and the solution with the best fitness value is chosen from the tournament to survive. This is performed to ensure good convergence quality as the best individual is chosen but also has a component of diversity through tournament size. Note that a tournament size of 1 is essentially equivalent to random selection; therefore, lower tournament sizes ensure good diversity, while large tournament sizes ensure that the best fitness value will be chosen.

U2_16.png

2.6.5 Boltzmann Selection

Boltzmann Selection is a another form of selection where the probability of being chosen is based on a simulated annealing equation. We will cover Boltzmann selection in the next unit when we discuss how to choose for survival between the offspring and the parents.

2.6.6 Mechanisms used along the selections

1) Elitism simply takes the best percent of the current population and always carries them over to the next generation. This guarantees good convergence, but can also lead to poor diversity if the elitism proportion is large. Elitism is usually paired with random selection or any of the other techniques to get a mixture of good diversity and convergence.

2) Hall of Fame is a mechanism where the best individual of the population is set aside in the Hall of Fame, where after the algorithm has stopped, the best solution from this hall of fame is chosen. By doing this, it allows the algorithm to favor good exploration with random selection without the fear of losing the best individual.

Note that all of these methods are based off probability of selection, therefore an individual can be selected multiple times for survival and reproduction.

2.7 Reproduction

Now that we’ve selected which individuals will reproduce, we need a way to share genetic material between the parents and introduce more. In EA’s this is performed through Mutation and Crossover.

1) Crossover is how the genome of two parents are mixed together to create the offspring. It works by selecting a crossover point and then crossing over the genome between the parents to create two new offspring. We can see this by the following example:

U2_17.png

2) Mutation on the other hand, introduces new genetic material by mutating one gene or multiple genes in the genome of the offspring by randomly flipping bits. In practice, mutation is mainly used on poor individuals to encourage exploration while not mutating much on good solutions to ensure good convergence.

2.8 Stopping Conditions

Now that we’ve defined our reproduction operators, we simply repeat the entire process of selection and reproduction until the stopping conditions are met.

The four most common types of stopping criterions:

Common stopping criterions used are when there is no change in best fitness value after a certain number of generations. However, this could have problems where elitism is not used or the best solution is not chosen per generation, as the best solution value will fluctuate. In addition, if elitism is used, it can be shown that EA’s can work for long periods of generations in stagnation before finding a better solution.

One could instead stop when there is no change in mean fitness value, indicating a population has converged about an area. However, if the selection and reproduction operators encourage exploration over exploitation, the mean fitness value for the population will fluctuate with such great variance that this condition will never be met.

On the other, hand one could stop when an acceptable solution has been found, such as with Neural Network training when the MSE is below one, or for classification when the Cross Entropy Log Loss is near zero. However this can lead to premature exiting from the algorithm where a future better possible solution could have been found.

Lastly, the one most commonly found in practice, is just to simulate the algorithm for many generations and terminate when the maximum generation count has been reached.

2.9 Difference in Evolutionary Algorithms

There are many different fields of evolutionary computation.

2.10 Conclusion

The importance of an initial population is to ensure good initial diversity, along with choice of population size. The fitness function in most cases will be the optimization problem itself. Next, we need to decided how to encode the chromosome and what the phenotype will be. Next, we need to choose our selection and reproduction operators. With these two, remember the tradeoff between exploitation and exploration. Finally, we need to choose when we will stop our algorithm.

3. Genetic Algorithms

First, we will cover the standard canonical genetic algorithm. Then we will cover floating point representation, and move straight into crossover and mutation operators. Then we will cover genetic algorithm variants, advanced topics, and how to handle different optimization problems. Throughout the post we will have a few examples given in Python code. Because this unit will be quite extensive, it will be broken down into parts.

Table of Contents (Part 1)

3.1 Canonical Genetic Algorithm

The canonical genetic algorithm is regarded as the simplest and one of the earliest genetic algorithms ever used in practice. It utilizes binary/bit string representation of the genome for encoding and decoding, proportional selection through roulette wheel, one point crossover and uniform mutation in the genome.

In this algorithm, those who survive are the offspring from the selected parents for reproduction. In addition, instead of always creating the offspring through mutation or crossover, the offspring has a probability of being crossed over or mutated, where if they were not crossed over or mutated, they default to a copy of the parent with the best fitness value. The purpose of implementing a probability of crossover and mutation is to prevent gene information loss from the parents as the offspring might be worse than the parents.

3.2 Binary vs Floating Point Representation

1) The canonical genetic algorithm utilizes binary representation, as it is more inspired from biology, where evolution occurs in the DNA by altering the chemical compounds, like flipping bits for mutation or crossover of the genome. However, binary representation is no longer used in mass due to the problem of having to encode and decode floating point numbers into binary, especially if grays encoding is used to get around the hamming cliff problem discussed in the previous unit.

2) Floating point representation leaves the numbers as it is, thus saving time by not encoding or decoding into binary; however, it is less intuitive than binary representation as how does one mutate or crossover floating point numbers? We will address this problem later on.

Lastly, it might be questioned upon the accuracy of both representations, is floating point representation better in terms of finding solutions than binary? Well, it has been shown that floating point representation can work as well and sometimes better than standard binary encoding, so there is no reason to fret about algorithm quality if floating point representation is used.

3.3 Crossover Techniques

Crossover operators can be classified into three types, asexual, sexual and multi-recombination.

  1. Asexual means that an offspring is generated from one parent,
  2. Sexual means two parents produce one or two children, and then
  3. Multi-Recombination where more than two parents are used to produce one or more offspring.

In addition, from the selected parents for creating offspring, we have a probability of performing crossover in order to not lose the genetic information of the parents by always crossing over, where higher crossover probabilities equate to higher chances of crossing over. If crossover is not applied, then the offspring is a copy of the fittest parent. In addition, if one offspring is generated, it could replace the worst parent either directly or probabilistically through something like Boltzmann selection.

U3_1.png

There are three main types of crossover techniques for binary encoding, one point, two point, and uniform. We can see examples of these above.

  1. First in the top left, we have one point crossover where we choose a random point and switch the genome of the parents to create two hybrid offspring.
  2. Second, we have multi point crossover, depicted in the top right, where we select multiple points to crossover at.
  3. Lastly, we have uniform crossover, as depicted at the bottom, where each gene of the genome has some probability of being inherited by a parent.

There are two main types of floating point crossover techniques.

  1. Below we can see an example of them geometrically using multi parent recombination:

U3_2.png

  1. Next, we perform crossover by switching the floating point values for each of the variables. This type of crossover has only a discrete number of possible values as there is no variance included in creating the offspring, only the variable values present in the parents. This is a problem as we have no way to introduce new genetic material; however, we will solve this problem later by the usage of mutation.

U3_3.png

For our examples, we will use two common crossover techniques.

  1. The first operator is a type of geometric average between the parents through a linear combination of the components, this is given by the following equation when only two parents are used:
    $$\tilde{x}_{ij}=(1-\gamma)x_{1j}+\gamma x_{2j},\gamma\in[0,1]$$

    Note $\gamma=0.5$ is mean \ Make $\gamma\sim N(0.5,0.15)$ or $\gamma\sim U(0,1)$ \ Where $x_1$ and $x_2$ denote the parents, $\tilde{x}_i$ denotes the $i$'th child and $j$ denotes the $j$'th variable

Note that this can be extended for multiple parents where each parent value has its own gamma value where the sum of the gamma values equal one:

$$\tilde{x}_{ij}=\sum\gamma_px_{pj}, \sum\gamma_p=1$$
  1. The second crossover method is what I like to call the 'intuitive' crossover method; as stated before, this method works by making the offspring inherit each of their genes from one of their parents. This method can also be extended to multiple parents where now the offspring’s genes are sampled from the selected parents. In this approach, the question then becomes which gene the child should inherit from, this is called scanning strategy. The probability of inheriting a gene from any of the parents could be uniform or it could be based on their fitness values by using a proportional selection method.

Intuitive Crossover(Inherit from Parent Probabistically): \

Two Parents: $\tilde{x}_{ij}=x_{1j}$ or $\tilde{x}_{ij}=x_{2j}$ \ Multi Parent: $\tilde{x}_{ij}=x_{pj}$

Examples of Crossover techniques

U3_4.png

Above we have a little bit of a visual representation of what happens when we use the first crossover technique. This was performed by randomly selected two individuals to mate and reproduce an offspring through a linear combination, where lambda was uniformly random from 0 to 1.

We can definitely see that

U3_5.png

Now, we have above a visual representation of what happens when we use the second crossover technique. This was performed by randomly selected two individuals to mate and switching the x and y values of the two parents to create two offspring.

We can see in the pictures on the right that by generation 300 the algorithm leaves some major holes that it does not explore. This is due to the fact that this crossover technique does not introduce any new genetic material.

In our example, because no initial value had an x value of -50, none of the offspring by generation 300 explored near that region. This is why it is vital to pair crossover techniques with mutation, which we will cover next.

3.4 Mutation Techniques

Now that we’ve performed crossover on our offspring, we need a way to introduce new genetic material, this is done through mutation.

For binary representation, we usually have two main ways to perform mutation, (1) one of them is through uniform mutation where each bit has an equal chance of being flipped or (2) by only randomly flipping one of the bits.

U3_6.png

For floating point representation there are many different mutation techniques; however, I will first introduce a basic mutation method where we add a random uniform value to each component of the offspring from the uniform domain of the bounds for each given variable: $$x_j=x_j+U(-x_j^{bound},x_j^{bound})$$

Notice that if this bound is large it would encourage exploration while if it is small it will encourage exploitation.

The choice for the max mutation value is up to the designer, but it is usual to keep it at around 1% of the total domain space for a given variable.

Example of Mutation Technique

U3_7.png

Above we have a visual representation of what happens when we use the mutation technique in floating point representation. This was performed by adding a small random value to each of the x and y direction for each individual.

We can see already that mutation is simply exploring and creating a small cluster around the point. This shows the power of mutation by introducing new genetic material.

Example of Crossover with Mutation

U3_8.png

Lastly, we have above a visual representation of what happens when we use the second crossover technique paired with mutation. This was performed by randomly selected two individuals to mate and switching the x and y values of the two parents to create two offspring and then adding a small random uniform value to both the x and y direction.

If you remember, values near -50 were not explored due to the initial generation lacking any points with that value; however, because we added mutation it introduced new genetic material that allowed crossover to explore that area. This shows the power of crossover paired with mutation as a viable search tool for genetic algorithms.

Example of Short Trivial Problem

Lets go over a trivial example. Suppose we have the following function below we want to maximize: The function is f(x)=-x²+3, where the maximum occurs when x=0 at f(0) = 3.

U3_9.png

  1. we must create an initial population that spans the entire domain space, so we will create our initial points uniformly random from the domain space of -3 to 3.
  2. determine the size of our generations, this is also directly proportional to the success of the algorithm.
  3. we need to choose our fitness function; because we want to find the maximum of our function, our fitness function will be the optimization function itself as we will reward points with larger output values.
  4. we need to choose our genome encoding, we will use floating point representation as it is faster to encode and decode than binary and it is what is used in modern practice.
  5. we need to decide which individuals are going to be selected for reproduction, we can do this by pairing up roulette wheel and elitism. For this problem we set elitism at 20%, meaning the best 20% of the generation will always be carried over; however, because our initial size is only 5, 20% really only means one individual.
  6. For reproduction, we will use the average of the parents where gamma is uniformly distributed from 0 to 1, and where the mutation value would be 1% of the entire bound.
  7. Because we do not always want to perform crossover or mutation as it might lose of the information from the parents, we will probabilistically do so only performing these operations 75% of the time.
  8. our algorithm will stop when it reaches the maximum number of iterations allowed.

The code:

Table of Contents

3.5 Control Prameters

There are three types of hyperparameters that influence the quality of the algorithm:

Static hyperparameters are just like they sound, they are static throughout the entire evaluation of the algorithm. Large static values encourage exploration (except for elitism), while small static values encourage exploitation (except for elitism).

Dynamic hyperparameters dynamically decrease from large values to small values over the course of the algorithm’s execution. This is performed to allow exploration during the early stages of the algorithm and exploitation later on. The problem with dynamic hyperparameters are that the parameters will decrease no matter what as each generation passes, even if they are performing well, while the problem with static hyperparameters is that they never change.

Self-Adaptive hyperparameters only change depending upon how successful they are, if the max mutation bound value is too low it will self-adapt and increase, or decrease if it is too high. These parameters are integral part for unit 5.

For Dynamic hyperparameters, the most common way to handle this dynamic decrease is either through linear or non-linear functions.

  1. For Linear functions, you give the algorithm the max and min hyperparameters value and it will create a straight line of values from the max to min over the number of given generations.
  2. Non-linear functions perform this decrease in value through non-linear fashions. The most common non-linear function is Logistic Decay: $$f(t)=\frac{2\cdot p_{max}^2\cdot e^{r\cdot t}}{p_{max}+p_{max}\cdot e^{r\cdot t}}$$

Where $p_{max}$ is the max value for the parameter, $t$ is the generation, and $r$ is the rate of decay. Smaller values for $r$ encourages exploitation, while larger values for $r$ encourage exploration.

Because Logistic Decay is logistic in nature, as the generation count approaches infinity the parameter value will approach zero, which will lead to stagnation and premature convergence. We do not want this to happen in practice, so we implement a max operation such that if the new parameter value falls below some minimum it will be set to that minimum.

U3_10.png

3.6 Genetic Algorithm Variants

3.6.1 Generation Gap Methods

Generation Gap Methods are variants that implement different survival selection methods for determining who survives into the next generation. There are three popular methods for this approach:

In Parent-Offspring battles, the pair of parents face off against their set of offspring, where the offspring are accepted probabilistically based upon their fitness value. The most common way to handle this is through Boltzmann Selection, which we will cover in the next post.

Like in the first method, we always create offspring, thus removing the need for probabilities of crossover and mutation, and pool them together with the parents. After pooling the parents and offspring together, you use tournament selection for survival, except where the worst fit individual is killed off. You perform this until the offspring parent pool decreases back down to its original size. The benefit of this procedure is that that battle is not directly between the parents and offspring pairs, but within the total pooled population, as the offspring between a set of parents could be worse but yet better than another set of parents.

Elitism strategies are like that of the second, pooling all the parents and offspring together, and then sorting them based off fitness or age. If sorting based off fitness, this method is like elitism in that the best half from the sorted individuals are chosen to survive. Besides sorting off fitness, one could sort by age and fitness where younger individuals are preferred over older. Age is denoted in algorithms by how many generations they have survived. The benefit of this procedure is that the algorithm does not hold onto old solutions, which could cause stagnation; but then the problem becomes that the algorithm will reject good old solutions in favor of new even if they are worse.

3.6.2 Growing Population

The population size of the algorithm dynamically grows to model population sizes in nature.

There are two common ways to do this.

  1. One is through logistic decay and growth. Starting at our initial population size we logistical grow or shrink depending upon certain circumstances. For example, if our algorithm is converging we might grow our size to allow for more exploration while decreasing if our algorithm is not converging to encourage exploitation. The main problem in this procedure is to determine exactly when to decay or grow, and by how much, thus adding more hyperparameters to the problem.
  2. ?

3.6.3 Running in Parallel

Running in Parallel variants are Genetic Algorithms designed to run in parallel.

In industry, running Genetic Algorithms on real world problems can be extremely time consuming and power hungry. We can improve the complexity cost by running our algorithm in parallel. This can be done in three ways: perform our fitness function in parallel, assign a processer per individual, or assign sub populations per processor.

3.6.4 Island Algorithms

U3_11.png

Island Algorithms, as we can see above, are extremely close to Running in Parallel variants in that they are usually performed on multiple processors where each processor represents an island; however, these algorithms can be implemented on a single processor.

In Island Algorithms we have what are known as migration policies, which contain initialization, communication topology, migration rate, migration selection, and replacement selection.

Migration and replacement selection can be performed using any of the selection strategies discussed thus far in the series. The purpose of migration is to introduce new genetic material to an island.

3.7 Selective Mating through Clustering

One of the main problems with crossover is that it may lead to poor individuals. Take for example the following function below:

U3_12.png

Problem: We can see here from the topology that there are four equal global minimum values at each corner of the domain. If our algorithm has two individuals with similar fitness scores but at different locations, and we were to crossover using the averaging strategy our offspring would end up in the middle at a poor location.

This is where the offspring parent battle would be a useful mechanism as the parent would not be replaced due to the inadequacy of the offspring. On the other hand, if we did not have this component, we could implement selective mating through clustering.

Selective mating means that individuals are only allowed to mate with individuals within the same cluster, thus preventing two individuals with a far Euclidean distance from mating.

The only problem with this line of thinking is that can lead to poor exploration due to individuals only mating and reproducing within the same cluster, thus reducing the global sharing of genetic material.

Selective mating can easily be implemented with the island algorithm as only individuals within the same island are allowed to mate and reproduce. The only problem is determining the number of islands and who goes where. A solution to the number of islands is to use some type of criterion, like Silhouette score in K-means clustering.

As choosing the cluster number for k means clustering is complicated, we can choose the best number of clusters by the silhouette method. This method starts with two initial clusters and increases until it find the cluster count where the distances between each point cluster and the corresponding k mean point is the smallest.

This is denoted by a silhouette score between -1 and 1, where 1 means the clusters are very dense with no overlap, as we can see in the picture on the right hand side, and -1 where data does not belong to cluster. We keep increasing the cluster count until our max cluster choice and choose the cluster count with the best score.

This can be used on the initial population to divide it into clusters to be partitioned in the island algorithm.

Note that choosing a uniform initial population will not work with K-means clustering; therefore, when uniform initial populations are used we can instead divide the domains space into squares and give each of those subdomains to an island, and then use K-means clustering for choosing who reproduces with whom.

U3_14.png

3.8 Constraint Handling

Memorizing the Unit 1, there are four typrs of optimization problems:

  1. Unconstrained
  2. Constrained
  3. Multi-Objective
  4. Multi-Solution

Unconstrained problems are easy, we’ve been doing them this entire time.

However, one thing we have not covered are problems where the domain is discrete. The most popular way to get around this is to not simply change anything, but round the variable value to the next integer when it comes time to test the fitness function.

For constrained problems, the common solution is simply to add penalty terms for solutions that are infeasible, or change the reproduction and initialization operators such that no infeasible solution is created.

For Multi-objective problems, one can create an aggregate function by simply adding all objective functions together and minimizing that total function; or, one could use pareto dominance where the parents and offspring are pooled together and a tournament style selection is used and the individual that is the most dominate is chosen to survive.

Multi solution problems can be solved by using a form of an island algorithm. In multi solution problems the global min or max is usually known but the goal is to find every location where this occurs.

Therefore, in our island algorithm, each time a new individual is spawned where its fitness value is close or equal to the global min or max but its Euclidean distance is different by some value from the mean of the island, it migrates to an island where the mean is closest.

However, if no island exists where its mean Euclidean distance is different from the individuals by some value, it creates its own island by itself and reproduces asexually until new offspring or new migrants are present.

In this way, island algorithms can solve multi solution problems by having each island be its own cluster with the same global min or max solution. After running for the generation count, the number of islands represents the number of different positions found.

3.9 Evolving Neural Network for Time Series Analysis

The much anticipated application of evolving the weights of a Neural Network for Time Series Analysis!

Table of Contents

3.9.1 Time Series Problem

Time series analysis refers to any series of data where the data are sequential in terms of order of time.

Time series analysis is extremely applicable in many fields, from economic to weather projections and forecasting.

The problem we will be analyzing will be a Sunspot Time Series dataset. Sunspots Cycle’s are solar magnetic activity that occurs on the sun. The monthly mean number of sunspots has been recorded since 1749. Our problem is to see how well we can model this time series problem and predict future monthly mean numbers. Here we have a view of what the time series data set looks like over its entire lifespan:

U3_15.png

Data is from Kaggle.

3.9.2 Review of Neural Networks

We will be attempting to solve this problem through a Feed Forward Neural Network.

Artificial neural networks were derived with biological inspiration from biological neural networks, the basic unit of the brain. These models have the ability to perform classification, regression, and other techniques. Neural Networks have become extremely common in Machine Learning and Artificial Intelligence due to their unexpected success in many problem types.

The architecture of a neural network can be splitted into three components:

The weights between each layer is represented by a matrix.

U3_16.png

We represent each connection between two neurons by a weight, which can be combined to form a matrix. A forward pass through our network is simply multiplying the matrices together, applying our transition function at each layer, and then getting our result.

The bias is an [1,M] vector added to each observation after the activation function, where M is the number of nodes for that layer.

Neural networks are commonly trained through a form of backpropagation; where the current error, either MSE or Cross-Entropy, is propagated back from the output layer throughout the hidden layers to adjust the weights such that they minimize the error. However, in our application, instead of using backpropagation we will use a genetic algorithm to train the weights.

3.9.3 Design of the Neural Network in case of Time Series

The most common implementation for solving time series problems in neural networks is by using a recurrent neural network.

A recurrent neural network is a network where there is a recurrent layer from the output that is fed back into the input layer for the next value at the subsequent time index.

However, an easy way to get around this implementation is to simply make the input layer a ‘window’ that spans some value of indices before the current value.

Example

The goal is to predict values in black boxes. There is used the values in gray boxes before black boxes. Gray boxes make up a window for prediction. Current window size is 3. Actually it has to be tested.

U3_17.png

3.9.4 Implementing the Neural Network

Because we will be evolving the weights of a neural network, we need to implement one from scratch as it would require a lot of maneuvering to do so in common Python implementations. We will implement our network to accommodate an arbitrary number of layers and nodes per layer, but will only have the ReLU activation function for each layer.

3.9.5 Training, Validation and Testing Data Sets of Time Series

The most common downfall for all machine learning models is known as over-fitting. This occurs when our model has so many tunable parameters that it starts to memorize the input. As a result, it performs worse when predicting a value that it has never seen before than the values it was trained upon.

The goal of all of Machine Learning and Data Science is to create simple but yet powerful models that can adapt and can interpolate new values accurately.

There are many different ways to prevent overfitting in neural networks, from dropout layers to cross-validation; however, the most common is simply early stopping.

We use our training dataset to train our model and then test its performance using the testing dataset. In this way, we train on our training set and once we start to see overfitting, meaning the error in our testing set starts to increase, we stop our algorithm. This is known as early stopping.

U3_18.png

In this case we have already used the test set. Thus nothing to test on after training hyperparameters. To get around this, we can introduce the validation dataset, which replaces the testing dataset in early-stopping.

We now split our dataset into three parts,

  1. We train our model on the training sets.
  2. Then we perform early stopping, parameter tuning, and model comparison using the validation set.
  3. After, and only after, we’ve chosen our final model, along with its final architecture, do we evaluate the testing dataset.

By doing this, we ensure that our model has never seen the values in the testing dataset in order to get an accurate evaluation of our model.

3.9.6 The Genetic Algorithm

Now it’s time to discuss the exact implementation details of our evolutionary algorithm.

A. As discussed in the previous posts, each individual is made up of its genotype and phenotype.

In our problem, the genotype is made up of weight and bias matrices, where each matrix is a gene in the genome. For the phenotype, the weight and bias matrices compiled together form a neural network in the environment.

B. Now that we know how we will encode our chromosome, it is time to discuss the reproduction operators. For selection, we will use roulette wheel selection, which works by creating a cumulative distribution from the proportion of an individual being chosen based off its fitness value:

C. For crossover, we will implement the averaging technique, which takes a linear combination of the parent values. For our problem, the offspring weight and bias matrices will just be a linear combination of the parent’s matrices. To do this, we first instantiate a new EvolvableNetwork, except this time with initialize equal to False, because we do not want the weight and bias matrices to be initialized to random values as we will create them from the parents:

D. For mutation, we will simply add some small random value to each entry of the weight and bias matrix for all matrices:

Unlike in our previous Genetic Algorithm implementations, this particular implementation will not have hyperparameters for probabilities of mutation, crossover, or elitism.

E. Instead, we can reduce the complexity of tuning our algorithm by getting rid of these parameters. To do this, our set of parents will create a set of four children, where the children will be pooled along with their parents and the individual with the best fitness will be chosen to survive. For the offspring, all four will be created through crossover by different coefficient values, and then the last two cross-over offspring will be mutated by different random values.

Here is what our reproduce function will look like for two parents:

F. Now it is time for our fitness function, it will take the Mean Sum of Square Errors between the predicted and the actual values for our time series problem: $$ MSE = \frac{\sum_i^n (y_i-\hat{y}_i)}{n}$$

Because we are wanting to minimize the MSE error function, we need to scale our fitness values such that smaller values yield larger, and larger values yield smaller and then maximize that scaled fitness. We can do this through the following function:

Relationship between original fitness values and scaled fitness values:

U3_19.png

Next, we need to split our data into the training, validation, and testing sets. We will train our algorithms using the training dataset and use the validation set to compare our models and to prevent overfitting by early stopping. Lastly, after we’ve chosen our final model, we will evaluate it’s accuracy through the test dataset.

We will pass our training and validation data to our evolution algorithm. As stated before, we will test different window sizes, so we will loop over all possible window sizes, create the data, and test our algorithm:

It is time to define the body of the evolutionary algorithm:

For evaluation, we will find the model with the best validation score from each window size and then recreate the data based off that window size to evaluate our test MSE score.

3.9.7 Evaluating the Algorithm

7. DIFFERENTIAL EVOLUTION

- AUTOMATED MACHINE LEARNING

Table of Contents

7.1 Difference between DE and standard GA

Differential Evolution differs from standard genetic algorithms in that

The general flow of the DE algorithm can be depicted as follows:

U7_1.png

Compare it to the flow of the general GA:

U2_3.png

7.2 General Mutation and Crossover Operators

Mutation:

The mutation operator in Differential Evolution - the unit vector - is created from two components:

An exact definition below:

$$unit = target + scale(difference)$$$$u_i = x_{i1}+\beta(x_{i2}-x_{i3})$$

where $i\neq i1 \neq i2 \neq i3$, $u_i$ is unitvector, $x_{i1}$ is target vector, $x_{i2}$ and $x_{i3}$ are selected parents, and $\beta$ is a scaling coefficient

We can see the geometric power of the unit vector below:

U7_2.png

In the figure, there are three parent vectors, X1, X2, X3. We create the difference vector between X2 and X3, then scale by beta. After this scale, we add it to our target vector, X1, to obtain the final unit vector, U.

The unit vector, the final product of the mutation process, is passed along to be crossed over with the current ‘major’ parent, the parent of interest not included in the mutation operation.

Crossover:

As in Genetic Algorithms there are two types of crossover techniques in Differential Evolution: average and intuitive.

After creating the unit vector, we select the 'major' parent (different from the ones used before), an then perform crossover between this new parent and the unitvector to obtain final offspring.

Example:

U7_3.png

In the example we can see how the coordinates of the unit vector $U(x_u,y_u)$ and the chosen 'major' parent $P(x_p,y_p)$ is combined to get offspring: child1 $C_1(x_p,y_u)$ and child2 $C_2(x_u,y_p)$.

7.3 Creation of the Unit Vectors

There are many different strategies for creating unit vectors, known as the DE/x/y/z Strategies, where x denotes the chosen target vector, y denotes the number of difference vectors, and z represents the crossover method.

Here are the most common strategies:

DE/Best/1/z: This method uses only one difference vector, where the parents are randomly selected, and whose target vector is the best individual in the current population: $$u_i=\hat{x}+\beta(x_{i2}-x_{i3})$$

DE/Rand/n/z: This method works by selecting a random target vector and n randomly selected difference vectors: $$u_i=x_{i1}+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$

DE/Rand-to-Best/n/z: This method works by incorporating a linear combination between the best individual and a random parent for creating the target vector, and n randomly selected difference vectors: $$u_i=\gamma\hat{x}+(1-\gamma)x_{i1}+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$

DE/Current-to-Best/n/z: This method works by incorporating the ‘major’ parent along the best individual for creating the target vector, and n randomly selected difference vectors: $$u_i=x_i+\beta(\hat{x}-x_i)+\beta\sum_k^{n_v}(x_{i2,k}-x_{i3,k})$$

Using

A compromise between the two can be made by the use of Combined Strategies.

7.4 Parameter strategies

We can choose between following parameter strategies:

The major parameter in DE is the scaling coefficient $\beta$, which usually takes values between $[0,\infty)$:

It is standard to have beta around a value of 0.5 when using one difference vector and decrease by the ratio of the number of difference vectors.

7.5 Application of Automated Machine Learning

where is evolved the architecture of a Convolutional Neural Network for classifying images on the CIFAR-10 dataset.

7.5.1 Problem statement

Dataset: 60000 CIFAR10 small 32x32 RGB color images with 10 possible categories.

Model: a Convolutional Neural Network for categorizing the images of the dataset

U7_4.png

https://www.cs.toronto.edu/~kriz/cifar.html

Automated Machine Learning is a field of Optimization and Machine Learning theory where the goal is to find the best set of hypermeters for a given algorithm.

Why to use DE for Automated Machine Learning?

There is needed a critical assumption that the input space of our architecture only has one or a few local optima. On the other hand, if you have multiple processors and GPU’s that can spread out the computation, then using a standard genetic algorithm would be beneficial.

Another alternative is to use Evolutionary Programming which solves the problem of crossovers which is the Permutation Problem where the offspring is extremely poor despite the parents being extremely good. Evolutionary Algorithm does not use the crossover, only mutation, allowing the speciation.

7.5.2 Creating the Model Architecture: Modularization

The CNN will be split into two sections, concolution network and deep network, where each is composed of respective modules.

U7_10.png

  1. Convolution network: each convolution module can contain following layers: Convolution, Pooling, Batch Normalization, Activation Function, and Drop Out.

U7_7.png

The algorithm will have the ability to

The possible choices per layer in each individual convolution module:

U7_6.png

Table can be read so:

etc.

  1. Deep network: for the deep modules the possible layers are Dense, Batch Normalization, and Dropout.

U7_8.png

Again, the algorithm will have the ability to

Here are the possible choices per layer in each individual module:

U7_9.png

7.5.3 Permutation Problem

The permutation problem, otherwise known as the competing conventions problem, specifically refers to the problem of performing crossing over with two or more parents to obtain the offspring.

The permutation problem arises when the phenotypes of the network architectures between the parents are so different that much of the genetic information is lost during the crossover process, often producing an offspring that is extremely poor.

For example:

U7_11.png

In this example, there is evolved the weights, number of layers, number of nodes, along with activation functions per layer for a population of networks.

The problem is that the architectures between the parents are so different that when we perform crossover, we get a hybrid offspring with a brand new architecture.

The general solutions for the permutation problem are:

Solution for the current problem is to use two exactly the same architectures for CNN, but different optimizer or activation function. For the simplification there will be no evolving or speciation for optimizers and action functions. These will be static.

7.5.4 Differential Evolution Algorithm

Dataset:

Training:

In this way, the training time can be minimized many times and the power of the evolutionary search process maximized.

Algorithm:

A L G O R I T H M

Goal of the algorithm is to find the best possible CNN which classifies the CIFAR-10 images into 10 classes.

Initial population: Population of 20 CNN-s with similar architecture.

Selection for Multi-Parent Reproduction: one parent (CNN) is target vector, to parents are used for difference vector and final parent is the 'major' parent which has the best finess.

Mutation: Creating Unit Vector: for creating the unit vector (for every individual (CNN) in the generation) there is used the DE/Rand/n/z strategy with only one difference vector: $$u_i=x_{i1}+\beta(x_{i2}-x_{i3})$$

Crossover 'Major' Parent with Unit Vector to create Offspring: 'major' parent is crossed over with unit vector to obtain offspring (new CNNs) using averaging technique

Select best from Child and Parent: if the architecture of the child (CNN) yields better validation accuracy than the 'major' parent then the parent will be replaced. Validation accuracy is the result of the training and validation processes that is implemented by fitness function. Fitness function calculates the fitness for each individual (CNN) in the generation.

7.5.5 Results

7.5.6 Code

CONCLUSION